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We propose a method for annotating the location of objects in ImageNet. Tradi¬ 
tionally, this is cast as an image window classification problem, where each window is 
considered independently and scored based on its appearance alone. Instead, we propose 
a method which scores each candidate window in the context of all other windows in 
the image, taking into account their similarity in appearance space as well as their spa¬ 
tial relations in the image plane. We devise a fast and exact procedure to optimize our 
scoring function over all candidate windows in an image, and we learn its parameters 
using structured output regression. We demonstrate on 92000 images from ImageNet 
that this significantly improves localization over recent techniques that score windows in 
isolation [E3,E3]. 

Introduction 


^he ImageNet database [0] contains over 14 million images annotated by the class label 
^ the main object they contain. However, only a fraction of them have bounding-box an- 
^tations (10%). Automatically annotating object locations in ImageNet is a challenging 
*^oblem, which has recently drawn attention [O, IIB, E3] . These annotations could be used 
as training data for problems such as object class detection [H], tracking [O] and pose esti¬ 
mation [i]. Traditionally, object localization is cast as an image window scoring problem, 
where a scoring function is trained on images with bounding-boxes and applied to ones with¬ 
out. The image is first decomposed into candidate windows, typically by object proposal 
generation [□, O, IZ3, E3]. Each window is then scored by a classifier trained to discriminate 
instances of the class from other windows [S, O, O, O, E3, El, EE] or a regressor trained 
to predict their overlap with the object [0, ED, E3, E3]. Highly scored windows are finally 
deemed to contain the object. In this paradigm, the classifier looks at one window at a time, 
making a decision based only on that window’s appearance. 

We believe there is more information in the collection of windows in an image. By 
taking into account the appearance of all windows at the same time and connecting it to their 
spatial relations in the image plane, we could go beyond what can be done by looking at one 
window at a time. Consider the baseball in fig. 1(a). For a traditional method to succeed, 
the appearance classifier needs to score the window on the baseball higher than the windows 
containing it. The container windows cannot help except by scoring lower and be discarded. 
By considering one window at a time with a classifier that only tries to predict whether it 
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Figure 1: Connecting the appearance and window position spaces, (a) a window tight on the 
baseball (green star in the appearance space plot) and some larger windows containing it (red circles 
in the appearance space). Black points in appearance space represent all other candidate windows, 
(b) all windows with high overlap with the wolf are shown in red, both in the image and in appearance 
space. The ground-truth bounding-box to be found is shown in green. The appearance space plots 
are actual datapoints, representing windows in 3-dimensional Associative Embedding of SURF bag- 
of-words. Please see main text for discussion. 

covers the object tightly, one cannot do much more than that. The first key element of our 
work is to predict richer spatial relations between each candidate window and the object to 
be detected, including part and container relations. The second key element is to employ 
these predictions to reason about relations between different windows. In this example, the 
container windows are predicted to contain a smaller target object somewhere inside them, 
and thereby actively help by reinforcing the score of the baseball window. Fig. 1(b) illustrates 
another example of the benefits of analyzing all windows jointly. Several windows which 
have high overlap with each other and with the wolf form a dense cluster in appearance 
space, making it hard to discriminate the precise bounding-box by its appearance alone. 
However, the precise bounding-box is positioned at an extreme point of the cluster — on the 
tip. By considering the configuration of all the windows in appearance space together we 
can reinforce its score. 

In a nutshell, we propose to localize objects in ImageNet by scoring each candidate win¬ 
dow in the context of all other windows in the image, taking into account their similarity in 
appearance space as well as their spatial relations in the image plane. To represent spatial 
relations of windows we propose a descriptor indicative of the part/container relationship 
of the two windows and of how well aligned they are (sec. 2). We learn a windows ap¬ 
pearance similarity kernel using the recent Associative Embedding technique [E3] (sec. 3). 
We describe each window with a set of hyper-features connecting the appearance similarity 
and spatial relations of that window to all other windows in the same image. These hyper¬ 
features are indicative of the object’s presence when the appearance of a window alone is not 
enough (e.g. fig 1). These hyper-features are then linearly combined into an overall scoring 
function (sec. 4). We devise a fast and exact procedure to optimize our scoring function over 
all candidate windows in a test image (sec. 4.1), and we learn its parameters using structured 
output regression [EZD] (sec. 4.2). 

We evaluate our method on a subset of ImageNet containing 219 classes with more than 
92000 images [O, IE3, E3]. The experiments show that our method outperforms a recent 
approach for this task [E3], an MKL-SVM baseline [El] based on the same features, and 
the popular UVA object detector [E3]. The remainder of the paper is organized as follows. 
Sec. 2 and 3 introduce the spatial relation descriptors which we use in sec. 4 to define our 
new localization model. In sec. 5 we review related work. Experiments and conclusions are 
presented in sec. 6. 
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Figure 2: Spatial relations p (w, w') between two windows, (a) The first element indicate how much 
w and w' overlap, in the traditional PASCAL VOC sense [E3J. The second element indicates whether 
window w is a part ofw'. The third element measures whether w is a container ofw'. (b) Computing 
the spatial relations p(w/,w/) and p{wj,wi) for hyper-features (I)g (^) (^nd W • 

2 Describing spatial relations between windows 


Candidate windows. Recently, object class detectors are moving away from the sliding- 
window paradigm and operate instead on a relatively small collection of candidate win¬ 
dows [□, O, [II, 0,123, E3, E3, EB] (also called ‘object proposals’). The candidate window 
generators are designed to respond to objects of any class, and typically just 1000 — 2000 
candidates are sufficient to cover all objects in a cluttered image [□, O, IZ3, E3]. Given a test 
image, the object localization task is then to select a candidate window covering an instance 
of a particular class (e.g. cars). Following this trend, we generate about 1000 candidate 
windows W = using the recent method [IZ3]. 


Spatial relation descriptor. We introduce here a representation of the spatial relations be¬ 
tween two windows w and w', which we later use in our localization model (sec. 4). We 
summarize the spatial relation between windows w and w' using the following spatial rela¬ 
tion descriptor (fig. 2(a)) 


p{w,w') 


wDw' w n w' 
w U w' ’ w 


wD w'\ 
w' ) 


( 1 ) 


where the D operator indicates the area of the intersection between the windows, and U the 
area of their union. The descriptor captures three different kinds of spatial relations. The first 
is the familiar intersection-over-union {overlap), which is often used to quantify the accuracy 
of an object detector [ffl, O]. It is 1 when w = w', and decays rapidly with the misalignment 
between the two windows. The second relation measures how much of w is contained inside 
w'. It is high when w is a part of w', e.g. when w' is a car and w is a wheel. The third 
relation measures how much of w' is contained inside w. It is high when w contains w\ e.g. 
w' is a snooker ball and w is a snooker table. All three relations are 0 if w and w' are disjoint 
and are 1 if w and w' match perfectly. Hence the descriptor is indicative for part/container 
relationships of the two windows and of how well aligned they are. 


Vector field of window relations. Relative to a particular candidate window w/, we can 
compute the spatial relation descriptor to any window w. This induces a vector field p(', w/) 
over the continuous space of all possible window positions. We observe the field only at the 
discrete set of candidate windows W. A key element of our work is to connect this field of 
spatial relations to measurements of appearance similarity between windows. This connec¬ 
tion between position and appearance spaces drives the new components in our localization 
model (sec. 4). 









































4 


VEZHNEVETS AND EERRARI: LOOKING OUT OE THE WINDOW 


3 Predicting spatial relations with the object 

A particularly interesting case is when w' is the true bounding-box of an object w*. For 
the images in the training set, we know the spatial relations p(w,w*) between all candidate 
windows w and the bounding-box w*. We can use them to learn to predict the spatial relation 
between candidate windows and the object from window appearance features x in a test 
image, where ground-truth bounding-boxes are not given. 

Following [E3], we use Gaussian Processes regression (GP) [IZ3] to learn to predict a 
probability distribution P(p''(w,w*)|x) ^ QV{m{x)^K{x^x')) for each spatial relation r G 
{overlap,part, cont} given window appearance features v. We use zero mean m{x) = 0 
and learn the kernel (covariance function) K{x,x') as in [E3]. This kernel plays the role of 
an appearance similarity measure between two windows. The GP learns kernel parameters 
so that the resulting appearance similarity correlates with the spatial relation to be predicted, 
i.e. so that two windows which have high kernel value also have a similar spatial relation to 
the ground-truth. We will use the learnt K{x^x') later in sec. 4. 

For a window Wi in a test image, the GP predicts a Gaussian distribution for each relation 
descriptors. We denote the means of these predictive distributions as p(v/) = 
^part(^.),^cont(^.)), Standard deviation as cr(v/). The standard deviation is the 

same for all relations, as we use the same kernel and inducing points. 


4 Object localization with spatial relations 

We are given a test image with (a) set of candidate windows W = (b) their ap¬ 
pearance features X = (c) the mean M = and standard deviation Z = 

{cr(v/)}^^ of their spatial relations with the object bounding-box, as predicted by the GP; 
(d) the appearance similarity kernel K{xi,Xj) (sec. 3). 

Let w/ G W be a candidate window to be scored. We proceed by defining a set of hyper¬ 
features d>(X, W,M, /) characterizing w/, and then define our scoring function through them. 


Consistency of predicted & induced spatial relations 0c 

I 


( 2 ) 


Assume for a moment that w/ correctly localizes an instance of the object class. Select¬ 
ing w/ would induce spatial relations p^(wi^wi) to all other windows w/. The hyper-feature 
0c checks whether these induced spatial relations are consistent with those predicted by GP 
based on the appearance of the other windows (il^{xi)). If so, that is a good sign that w/ 
is indeed the correct location of the object. More precisely, the hyper-feature measures the 
disagreement between the induced p^(wcw/) and predicted ld^{xi) on the window Wf with 
the largest disagreement. Fig. 3 illustrates it on a toy example. The maximum disagreement, 
instead of a more intuitive mean, is less influenced by disagreement over background win¬ 
dows, which are usually predicted by GP to have small, but non-zero relations to the object. 
It focuses better on the alignment of the peaks of the predicted and observed 

measurements of the vector field p^{',wi), which is more indicative of w/ 
being a correct localization. 
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Figure 3 : Hyper-features 0c for a window wi are computed by finding a maximum disagreement 
between spatial relations that are predicted by GP p^{xi) and induced by spatial relations p{wi,wi) 
for all other windows Wf in the image. The figure illustrates this on a toy example with three windows 
wpWi,W 2 . For each r G {overlap,part, cont} the pairs of p^{xi) and p(wi^wi) with maximum 
disagreement are highlight in red. 

Global spatial relations & appearance 0 g 


<t>GiX,W,l) = E E \p’'{wGWi)-p''{wj,wi)\-K{xi,Xj). (3) 

i=\ 7=i+l 


This hyper-feature reacts to pairs of candidate windows (wi^wj) with similar appearance 
(high K{xi^Xj)) but different spatial relations to w/. Two windows Wi^Wj contribute signif¬ 
icantly to the sum if they look similar (high K{xi^Xj)) and their spatial relations to w/ are 
different (high \p^{wi,wi) -p^{wj,wi)\). 

A large value of 0 g indicates that the vector field of the spatial relations to w/ is not 
smooth with respect to appearance similarity. This indicates that w/ has a special role in the 
structure of spatial and appearance relations within that image. By measuring this pattern, 0 g 
helps the localization algorithm to select a better window, when the information contained 
in appearance features of w/ alone is not enough. For example, a window w/ tightly covering 
a small object such as the baseball in fig. 1(a) has high because other windows con¬ 

taining it often look similar to windows not containing it. In this case, a high value of 0 g is a 
positive indication for w/ being a correct localization. On the other hand, a window w/ tightly 
covering the wolf in fig. 1(b) has low because windows that overlap with it are all 

similar to each other in appearance space. In this case, this low value is a positive indication 
for w/ being correct. In which direction to use this hyper-feature is left to the learning of its 
weight in the full scoring function, which is separate for each object class (sec 4.2). 


Local spatial relations & appearance 0 l 

01 (x,w,/) = I ■K{xi,xi). ( 4 ) 

^ i=l 

This hyper-feature is analogue to 0g, but focuses around w/ in appearance space. It is indica¬ 
tive of whether windows that look similar to w/ (high K{xi,xi)) are also similar in position 
in the image, i.e. their spatial relation p^(wi^wi) to w/ is close to 1. 

Window classifier score 05. The last hyper-feature is the score of a classifier which pre¬ 
dicts whether a window w/ covers an instance of the class, based only on its appearance 
features xp Standard approaches to object localization typically consider only this cue [B, 
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Q, O, O, E3, El, EB]. In principle, we could use any such method as the score (j)s here. In 
practice, we reuse the GP prediction of the overlap of a window with the object bounding-box 
as 0^. One possibility would be to simply use the mean predicted overlap How¬ 

ever, as shown in m, it is beneficial to take into account the uncertainty of the predicted 
overlap, which is also provided by the GP as the standard deviation cr(v/) of the estimate 

= G{x,)] (5) 

Using this hyper-features alone would correspond to the method of [E3]. 

Complete score. Let 

^{X,W,l) = [{rG{X,WJ)}rdrL{X.W,l)}rd^C^^ 

be a concatenation of all hyper-features defined above for a particular candidate window w/, 
over all three possible relations: r G {overlap,part, cont}. This amounts to 11 features in 
total. We formulate the following score function 

^(a,X,W,/) = (a,<I>(X,W,/)) (6) 

where (•, •) is the scalar product. Object localization translates to solving 
I = argmax/ E{a^X^WJ) over all candidate windows in the test image. The vector of scalars 
a parametrizes the score function by weighting the hyper-features (possibly with a negative 
weight). We show how to efficiently maximize E over / in sec. 4.1 and how to learn OL in 
sec. 4.2. 


4.1 Fast inference 


We can maximize the complete score (6) over I simply by evaluating it for all possible I and 
picking the best one. The most computationally expensive part is the hyper-feature (3). 
For a given /, it sums over all pairs of candidate windows, which requires 0{N^) subtractions 
and multiplications. Thus, a naive maximization over all I costs 0{N^). 

To simplify the notation, here we write the score function with only one argument £’(/), 
as the other arguments a^X^W are fixed during maximization. Note that 0 < |p''(w/,w/) — 
p^(wj,w/)| < 1, therefore 





0 

aQK{xi^Xj) 


, a £<0 1 
,a £>0 / 


(7) 


Where is the weight of hyper-feature 0^. By substituting the elements in the sum over 
pairs in (3) with the bound (7), we obtain an upper bound on the term of the score (in 
fact three bounds, one for each r). We can then obtain an upper bound E{1) >E{1) on the 
full score by computing all other hyper-features and adding them to the bounds on 0^. This 
upper bound £(/) is fast to compute, as (7) only depends on appearance features X, not on /, 
and computing the other hyper-features is linear in /. 

We use the bound E in an early rejection algorithm. We form a queue by sorting windows 
in descending order of E{1). We then evaluate the full score E{1) of the first I in the queue 
and store it as the current maximum. We then go to the next I in the queue. If its upper 
bound E{1) is smaller than the current maximum, then we discard it without computing its 
full score. Otherwise, we compute E{1) and set it as the current maximum if it is better than 
the previous best one. We iteratively go through the queue and at the end return the current 
maximum (which is now the best over all /). Notice, that the proposed fast inference method 
is exact: it outputs the same solution as brute force evaluation of all possible /. 
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4.2 Learning a with structured output regression 

The scoring function (6) is parametrized by the weight vector a. We learn an a from the 
training data of each class using a structured output regression formulation [0, O, ED, E3]. 
Ideally, we look for a so that, for each training image /, the candidate window that best 
overlaps with the ground-truth bounding-box has the highest score. It is also good to en¬ 
courage the score difference between the best window 1] and any other window w/ to be 
proportional to their overlap. This makes the learning problem smoother and better behaved 
than when using a naive 0/1 loss which equally penalizes all windows other than //. Hence, 
we use the loss = 1 — w//) proposed by [0], which formalizes this intu¬ 

ition. We can find OL by solving the following optimization problem 

minf|a|p + rl(^/) 

a,? 2 Y 

S.T. > 0,V/: G LA// (8) 

where / indexes over all training images. This is a convex optimization problem, but it has 
hundred of thousands of constraints (i.e. about 1000 candidate windows for each training 
image, times about 500 training images per class in our experiments). We solve it efficiently 
using quadratic programming with constraint generation [0]. This involves finding the most 
violated constraint for a current OL. We do this exactly as we can solve the inference problem 
(6) and the loss A decomposes into a sum of terms which depend on a single window. Thanks 
to this, the constraint generation procedure will find the global optimum of (8) [ 0 ]. 

5 Related work 

The first work to try to annotate object locations in ImageNet [O] addressed it as a win¬ 
dow classification problem, where a classifier is trained on annotated images is then used to 
classify windows in images with no annotation. Later [E3] proposed to regress the overlap 
of a window with the object using GP [O], which allowed for self-assessment thanks to 
GP probabilistic output. We build on [E3], using Associative Embedding to learn the kernel 
between windows appearance features and GP to predict the spatial relations between a win¬ 
dow and the object. Note how the model of [E3] is equivalent to ours when using only the 
hyper-feature. 

Importantly, both [O, E3], as well as many other object localization techniques [S, O, 
E3, E3, E0], score each window individually based only on its own appearance. Our work 
goes beyond by evaluating windows based on richer cues measured outside the window. This 
is related to previous work on context [□, DU, O, HE, 123,120, ED] as well as to works that use 
structured output regression formulation [0, O, ED, E3]. We review these areas below. 

Context. The seminal work of Torralba [ED] has shown that global image descriptors such 
as GIST give a valuable cue about which classes might be present in an image (e.g. in¬ 
door scenes are likely to have TVs, but unlikely to have cars). Since then, many object 
detectors [□, O, E3, EE, E3, E3] have employed such global context to re-score their de¬ 
tections, thereby removing out-of-context false-positives. Some of these works also incor¬ 
porate the region surrounding the object into the window classifier to leverage local con¬ 
text [H, E3, El, E3]. 

Other works [□, DU, HE, E0] model context as the interactions between multiple object 
classes in the same image. Rabinovich et al. [E0] use local detectors to first assign a class 
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label to each segment in the image and then adjusts these labels by taking into account 
co-occurrence between classes. Heitz and Koller [EE] exploits context provided by "stuff" 
(background classes like road) to guide the localization of "things" (objects like cars). Sev¬ 
eral works [0,111]] model the co-occurrence and spatial relations between object classes in 
the training data and use them to post-process the output of individual object class detectors. 
An extensive empirical study of different context-based methods can found in [O]. 

The force driving those works is the semantic and spatial structure of scenes as arrange¬ 
ments of different object classes in particular positions. Instead, our technique works on 
a different level, improving object localization for a single class by integrating cues from 
the appearance and spatial relations of all windows in an image. It can be seen as a new, 
complementary form of context. 

Localization with structured output regression was first proposed by [0]. They devised 
a training strategy that specifically optimizes localization accuracy, by taking into account the 
overlap of training image windows with the ground-truth. They try to learn a function which 
scores windows with high overlap with ground-truth higher than those with low overlap. 
The approach was extended by [E3] to include latent variables for handling multiple aspects 
of appearance and truncated object instances. At test time an efficient branch-and-bound 
algorithm is used to find the window with the maximum score. Branch-and-bound methods 
for localization where further explored in [EE, ED]. 

Importantly, the scoring function in [0, EE, ED, EE] still scores each window in the test 
image independently. In our work instead we score each window in the context of all other 
windows in the image, taking into account their similarity in appearance space as well as 
their spatial relations in the image plane (sec. 4). We also use structured output regres¬ 
sion m\ for learning the parameters of our scoring function (sec. 4.2). However, due to 
interaction between all windows in the test image, our maximization problem is more com¬ 
plex than in [0, EE], making their branch-and-bound method inapplicable. Instead, we devise 
an early-rejection method that uses the particular structure of our scoring function to reduce 
the number of evaluations of its most expensive terms (sec. 4.1). 


6 Experiments and conclusions 

We perform experiments on the subset of ImageNet [0] defined by [O, DID, E3], which con¬ 
sists of 219 classes for a total of 92K images with ground-truth bounding-boxes. Follow¬ 
ing [E3], we split them in two disjoint subsets of 60K and 32K for training and testing 
respectively. The classes are very diverse and include animals as well as man-made objects 
(fig. 5). The task is to localize the object of interest in images known to contain a given 
class [O, DID, E3]. We train a separate model for each class using the corresponding images 
from the training set. 

Features. For our method and all the baselines we use the same features as AE-GP-f 
method from [E3]: (i) three ultra-dense SIFT bag-of-words histograms on different color 
spaces [E3] (each 36000 dimensions); (ii) a SURF bag-of-word from [E3] (17000 dimen¬ 
sions); (iii) HOG [S] (2048 dimensions). We embed each feature type separately in a 10- 
dimensional AE m space. Next, we concatenate them together and add location and scale 
features as in [E3, E3]. In total, this leads to a 54-dimensional space on which the GP oper¬ 
ates, i.e. only 54 parameters to learn for the GP kernel. 
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Figure 4: Mean overlap curves. We retain the single top scoring candidate window in a test image 
and measure the mean overlap of the output windows with the ground-truth. We vary a threshold on 
the score of the output windows to generate performance curves. The higher the curve, the better. 

6.1 Baselines and competitors 

MKL-SVM. This represents a standard, classifier driven approach to object localization, 
similar to [E3, El]. On 90% of the training set we train a separate SVM for each group 
of features described above. We combine these classifiers by training a linear SVM over 
their outputs on the remaining 10% of the data. We also include the location and scale 
features of a window in this second-level SVM. This baseline uses exactly the same candidate 
windows m and features as our method (sec. 6). 

UVA [E3]. The popular method [E3] can be seen as a smaller version of the MKL-SVM 
baseline we have just defined. In order to make a more exact comparison to [E3], we remove 
the additional features and use only their three SIFT bag-of-words. Moreover, instead of 
training a second level SVM, we simply combine their outputs by averaging. This corre¬ 
sponds to [E3], but using the recent state-of-the-art object proposals [IZ3] instead of selective 
search proposals. This method [E3] is one of the best performing object detectors. It has won 
the ILSVRC 2011 [ffl] detection challenge and the PASCAL VOC 2012 detection challenge. 

AE-GP [E3]. Finally, we compare to the AE-GP-f model of [E3] . It corresponds to a degen¬ 
erate version of our model which uses only (j)s to score each window in isolation by looking 
at its own features (5). This uses the same candidate windows [IZ3] and features we use. This 
technique was shown to outperform earlier work on location annotation in ImageNet [O]. 

6.2 Results 

For each method we retain the single top scoring candidate window in a test image. We 
measure localization accuracy as the mean overlap of the output windows with the ground- 
truth [E3] (fig. 4). We vary a threshold on the score of the output windows to generate 
performance curves. The higher the curve, the better. 

As fig. 4 shows, the proposed method consistently outperforms the competitors and the 
baseline over the whole range of the curve. Our method achieves 0.75 mean overlap when 
returning annotations for 35% of all images. At the same accuracy level, AE-GP, MKL-SVM 
and UVA return 28%, 9% and 4% images respectively, i.e. we can return 7% more annotation 
at this high level of overlap than the closest competitor. Producing very accurate bounding- 
box annotations is important for their intended use as training data for various models and 
tasks. Improving over AE-GP validates the proposed idea of scoring candidate windows by 
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Figure 5: Qualitative results. Results of our method (red) AE-GP [n3] (green). Notice, how our 
method is able to detect small, off-center objects despite occlusion (pool cue) or the object blending 
with its surroundings (tiger). 

taking into account spatial relations to all other windows and their appearance similarity. 
The favourable comparison to the excellent method UVA [E3] and to its extended MKL- 
SVM version demonstrates that our system offers competitive performance. Example objects 
localized by our method and by AE-GP are shown in fig. 5. Our method successfully operates 
in cluttered images (guenon, barrow, skunk). It can find camoufiaged animals (tiger), small 
objects (buckle, racket), and deal with diverse classes and high intra-class variation (pool 
cue, buckle, racket). 

To evaluate the impact of our fast inference algorithm (sec. 4.1) we compared it to brute 
force (i.e. evaluating the energy for all possible configurations) on the baseball class. On 
average brute force takes 17.6s per image, whereas our fast inference takes 0.14s. Since our 
inference method is exact, it produces the same solution as brute force, but 124x faster. 


6.3 Conclusion 

We have presented a new method for annotating the location of objects in ImageNet, which 
goes beyond considering one candidate window at a time. Instead, it scores each window in 
the context of all other windows in the image, taking into account their similarity in appear¬ 
ance space as well as their spatial relations in the image plane. As we have demonstrated on 
92K images from ImageNet, our method improves over some of the best performing object 
localization techniques [E3, E3], including the one we build on [E3]. 
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